Compte Google
Samuel Pariente
samuelipariente@gmail.com
Code
Insérer la cellule de code ci-dessous
Ctrl+M B
Texte
Ajouter une cellule de texte
Activer/Désactiver l'affichage de l'en-tête
Notebook
Code Texte

Python for Data Analysis - Final Projet


By Samuel Pariente and Marius Ortega
Code Texte

Here you can enter your name to register as a user in the notebook. Once you have done it, you can add your personal paths in the "Paths" section of the notebook.

Code Texte

user = "samuel" #Values : marius || samuel
Code Texte

Libraries

Code Texte

In this section we load all the libraries we need for the notebook to be working properly. Here is the extensive list of utilized libraries :

  • Data Analysis libraries :

    • Datetime
    • Numpy
    • Pandas
    • Missingno
  • Visualization libraries :

    • Seaborn
    • Matplotlib
    • Bokeh
    • Plotly
    • Wordcloud
    • BarChartRace
  • Machine and Deep Learning libraries :

    • Sklearn
    • Auto-Sklearn
    • Tensorflow
  • Scrapping libraries :

    • Selenium
  • Api libraries

    • Flask
    • Pickle
  • Date Storage related libraries :

    • Google Colab to Google Drive linkage
Code Texte

from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Code Texte

import pandas as pd
Code Texte

import seaborn as sns
Code Texte

from matplotlib import pyplot as plt

import matplotlib.colors as plt_colors
import matplotlib.cm as cm
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
from textwrap import wrap
Code Texte

from mpl_toolkits.mplot3d import Axes3D
Code Texte

import numpy as np
Code Texte

from datetime import datetime, timedelta
Code Texte

import plotly.graph_objects as go

from plotly.subplots import make_subplots

from gensim.models import KeyedVectors
Code Texte

from sklearn.decomposition import PCA
Code Texte

from sklearn import preprocessing
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import ARDRegression
from sklearn.model_selection import cross_validate,learning_curve
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score,classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import mutual_info_classif
Code Texte

from yellowbrick.model_selection import LearningCurve

!pip install auto-sklearn
import autosklearn
import autosklearn.classification
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: auto-sklearn in /usr/local/lib/python3.8/dist-packages (0.15.0)
Requirement already satisfied: scikit-learn<0.25.0,>=0.24.0 in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (0.24.2)
Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (1.7.3)
Requirement already satisfied: pyrfr<0.9,>=0.8.1 in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (0.8.3)
Requirement already satisfied: distro in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (1.8.0)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (1.21.6)
Requirement already satisfied: smac<1.3,>=1.2 in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (1.2)
Requirement already satisfied: pandas>=1.0 in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (1.3.5)
Requirement already satisfied: liac-arff in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (2.5.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (1.2.0)
Requirement already satisfied: distributed>=2012.12 in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (2022.2.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (4.1.1)
Requirement already satisfied: ConfigSpace<0.5,>=0.4.21 in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (0.4.21)
Requirement already satisfied: pynisher<0.7,>=0.6.3 in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (0.6.4)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (57.4.0)
Requirement already satisfied: dask>=2021.12 in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (2022.2.0)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (6.0)
Requirement already satisfied: threadpoolctl in /usr/local/lib/python3.8/dist-packages (from auto-sklearn) (3.1.0)
Requirement already satisfied: cython in /usr/local/lib/python3.8/dist-packages (from ConfigSpace<0.5,>=0.4.21->auto-sklearn) (0.29.32)
Requirement already satisfied: pyparsing in /usr/local/lib/python3.8/dist-packages (from ConfigSpace<0.5,>=0.4.21->auto-sklearn) (3.0.9)
Requirement already satisfied: partd>=0.3.10 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.12->auto-sklearn) (1.3.0)
Requirement already satisfied: cloudpickle>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.12->auto-sklearn) (1.5.0)
Requirement already satisfied: toolz>=0.8.2 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.12->auto-sklearn) (0.12.0)
Requirement already satisfied: fsspec>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.12->auto-sklearn) (2022.11.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from dask>=2021.12->auto-sklearn) (21.3)
Requirement already satisfied: psutil>=5.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2012.12->auto-sklearn) (5.4.8)
Requirement already satisfied: tornado>=6.0.3 in /usr/local/lib/python3.8/dist-packages (from distributed>=2012.12->auto-sklearn) (6.0.4)
Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2012.12->auto-sklearn) (1.0.4)
Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.8/dist-packages (from distributed>=2012.12->auto-sklearn) (1.7.0)
Requirement already satisfied: click>=6.6 in /usr/local/lib/python3.8/dist-packages (from distributed>=2012.12->auto-sklearn) (7.1.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from distributed>=2012.12->auto-sklearn) (2.11.3)
Requirement already satisfied: zict>=0.1.3 in /usr/local/lib/python3.8/dist-packages (from distributed>=2012.12->auto-sklearn) (2.2.0)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /usr/local/lib/python3.8/dist-packages (from distributed>=2012.12->auto-sklearn) (2.4.0)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0->auto-sklearn) (2022.6)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0->auto-sklearn) (2.8.2)
Requirement already satisfied: locket in /usr/local/lib/python3.8/dist-packages (from partd>=0.3.10->dask>=2021.12->auto-sklearn) (1.0.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.7.3->pandas>=1.0->auto-sklearn) (1.15.0)
Requirement already satisfied: emcee>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from smac<1.3,>=1.2->auto-sklearn) (3.1.3)
Requirement already satisfied: heapdict in /usr/local/lib/python3.8/dist-packages (from zict>=0.1.3->distributed>=2012.12->auto-sklearn) (1.0.1)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.8/dist-packages (from jinja2->distributed>=2012.12->auto-sklearn) (2.0.1)

import plotly.express as px

from collections import OrderedDict
Code Texte

!pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource, BasicTicker, ColorBar, LinearColorMapper, PrintfTickFormatter
from bokeh.plotting import figure, show
from bokeh.transform import factor_cmap, transform
output_notebook()

from bokeh.palettes import Spectral5
from bokeh.palettes import PRGn7
from bokeh.palettes import Magma7
from bokeh.palettes import Viridis
from bokeh.palettes import Pastel2_7
from bokeh.palettes import Purples9
from bokeh.palettes import Greens9
from bokeh.palettes import YlOrRd9

from bokeh.palettes import Magma4
from bokeh.palettes import Magma6
from bokeh.palettes import Magma7
from bokeh.palettes import Magma10
from bokeh.palettes import Magma

from wordcloud import WordCloud
from PIL import Image

from missingno import matrix
Code Texte

!pip install bar_chart_race
import bar_chart_race as bcr
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bar_chart_race
  Downloading bar_chart_race-0.1.0-py3-none-any.whl (156 kB)
     |████████████████████████████████| 156 kB 15.3 MB/s 
Requirement already satisfied: matplotlib>=3.1 in /usr/local/lib/python3.8/dist-packages (from bar_chart_race) (3.2.2)
Requirement already satisfied: pandas>=0.24 in /usr/local/lib/python3.8/dist-packages (from bar_chart_race) (1.3.5)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=3.1->bar_chart_race) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=3.1->bar_chart_race) (3.0.9)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=3.1->bar_chart_race) (0.11.0)
Requirement already satisfied: numpy>=1.11 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=3.1->bar_chart_race) (1.21.6)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=3.1->bar_chart_race) (1.4.4)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=0.24->bar_chart_race) (2022.6)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.1->matplotlib>=3.1->bar_chart_race) (1.15.0)
Installing collected packages: bar-chart-race
Successfully installed bar-chart-race-0.1.0

import tensorflow as tf
from keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras import optimizers
from tensorflow.keras.layers import Dropout
from keras.utils import to_categorical

import warnings

import pickle
import flask
from flask import Flask, request
Code Texte

1 - Data Pre-processing

Code Texte

In the prepocessing section, our objective in to make the data usable for visualisation and modeling usage.To do so, we have to perform multiple modifications on the dataset. Find the complete explanation below.


  1. Scrapping and Addition of columns
    • Using Selenium we add the title of each article and its author.
    • We also convert the "timedelta" to the actual release date of the article and put it in a new column.
    • We merge "timeline" dataset and "news" dataset in a new column.
    • We create a discretized version of the target column (shares) : a binary discretization and a multi class one.

  1. Cleaning
    • We perform a "Na" values study and notice that our dataset doesn't contains any trivial "na" values.
    • Columns 2 to 6 (Tokens related) : we verify that ratios columns are between 0 and 1. We check as well if articles have a content or not.
    • Colummns 13 to 18 (Chanels) : In anticipation of the vectorization we decode the 1 hot encoding and create a new qualitive column.
    • Columns 19 to 27 (Keywords) : Given that Keywords related columns represent numbers of shares, we verify their non-negativity.
    • Columns 31 to 37 (Day of the week) : In a similary fashion as chanels columns, we decode the 1 hot encoding to make it a new qualitative column.
    • Columns 39 to 43 (LDA topics) : These columns are ratio. Thus, we verify that they all lie between 0 and 1.
    • Columns 61 to 62 (Title and Author) : We must verify the completeness of these scrapped columns. In fact we find out that 619 columns are "Nan". We suppress them because it is only a small proportion of the dataset.

  1. Outlier Handling
    • We decide to select the columns of the dataset for which "shares" columns lie between first and third quartiles of the dataset.

  1. Vectorization
    • First, we load a google vectorization model pre-trained on online articles.
    • Then we vectorize our chanel column using this model.
    • However, the output of the model is a 300 dimensional vector which perturbates learning of models, especially deep learning sequential ones. Thus, we compute a PCA on this output vector and transform it to a 5 dimensional vector that we can concatenate with our pre-existing dataset.

  1. Creation of the working dataframes
    • We create 2 dataframes to work with.
    • The first one is the visualization dataframe, "v_news". It doesn't contain 1 hot encoded variables, or vectorized ones.
    • The second one, named "m_news" and created for modeling purposes, has only numerical columns, encompassing vectorized and 1 hot encoded columns. In addition, we remove non predictive columns from it, such as timedelta. Finally, we only keep columns that are less than 70% percent correlated with others.
Code Texte

1.5 - Creation of the working dataframes

Code Texte

export = False
Code Texte

Visualization dataframe
Code Texte

v_news = news.copy()

vecto_cols = [f"WChanel {i}" for i in range(300)]
_=v_news.drop(columns=vecto_cols, inplace = True)

_=v_news.drop(columns=['data_channel_is_lifestyle'], inplace = True)
_=v_news.drop(columns=['data_channel_is_entertainment'], inplace = True)
_=v_news.drop(columns=['data_channel_is_bus'], inplace = True)
_=v_news.drop(columns=['data_channel_is_socmed'], inplace = True)
_=v_news.drop(columns=['data_channel_is_tech'], inplace = True)
_=v_news.drop(columns=['data_channel_is_world'], inplace = True)

_=v_news.drop(columns=['weekday_is_monday'], inplace = True)
_=v_news.drop(columns=['weekday_is_tuesday'], inplace = True)
_=v_news.drop(columns=['weekday_is_wednesday'], inplace = True)
_=v_news.drop(columns=['weekday_is_thursday'], inplace = True)
_=v_news.drop(columns=['weekday_is_friday'], inplace = True)
_=v_news.drop(columns=['weekday_is_saturday'], inplace = True)
_=v_news.drop(columns=['weekday_is_sunday'], inplace = True)
_=v_news.drop(columns=['url'], inplace = True)
_=v_news.drop(columns=['timedelta'], inplace = True)

Machine and Deep Learning dataframe
Code Texte

m_news = news.copy()

Firstly, we remove columns that won't be used for prediction.


_=m_news.drop(columns=['Chanel'], inplace = True)
_=m_news.drop(columns=['Weekday'], inplace = True)
_=m_news.drop(columns=['url'], inplace = True)
_=m_news.drop(columns=['date'], inplace = True)
_=m_news.drop(columns=['Authors'], inplace = True)
_=m_news.drop(columns=['Titles'], inplace = True)
_=m_news.drop(columns=['timedelta'], inplace = True)
_=m_news.drop(columns=['is_weekend'], inplace = True)

Best features selection

Code Texte

x_train_selector.isna().sum().sum()
0



m_news.reset_index(drop=True, inplace=True)
m_news = pd.concat((m_news[selected_name_columns], m_news[exclude]), axis = 1)

nb_dim = 5
vec_dim = 300

sub_vec = m_news.loc[:, "WChanel 0":"WChanel 299"]
pca = PCA(n_components=nb_dim)
pca.fit(sub_vec)
explanation_coefs = pca.explained_variance_ratio_
vec_pca_values = pca.transform(sub_vec)
print(f"Explained information: {round(np.sum(explanation_coefs), 2)}")

df_reduced_vec = pd.DataFrame(vec_pca_values)
dic_vec_labels = {i:f"WChanel {i}" for i in range(nb_dim)}
df_reduced_vec.rename(columns=dic_vec_labels, inplace=True)

vec_labels = [f"WChanel {i}" for i in range(vec_dim)]
m_news.drop(columns=vec_labels, inplace=True)
m_news = pd.concat((m_news, df_reduced_vec), axis 1)
m_news.head(2)

if export:
    m_news.to_csv("m_news.csv")